Rectangular data
.xls).sav, STATA: .dat, etc.)This question checks your understanding of the coercion of boolean values.
What is the output of the following code?
my_vector <- c(1, 0, 3, -1) as.numeric(my_vector > 0)
c("TRUE", "FALSE", "TRUE", "FALSE")c(TRUE, FALSE, TRUE, FALSE)c(1,0,1,0)c(0,1,0,1)Be the data frame
dataCHframe <- data.frame(
"City" = c("St.Gallen", "Lausanne", "Zürich"),
"PartyLeft" = c(35, 45, 55),
"PartyRight" = c(40, 35, 30)
)
Which of these statements are TRUE?
dataCHframe$PartyCenter <- c(25, 20, 15) creates a new variable called “PartyCenter”dim(dataCHframe[, dataCHframe$PartyLeft > 40]) returns the same as dim(dataCHframe[, c(2,3)])dim(dataCHframe[dataCHframe$PartyLeft > 40 | dataCHframe$PartyLeft < 40, ]) returns c(3,3)dataCHframe is a list consisting of one named character vector and two named integer vectors.You want to import a file using read_delim(). Describe what read_delim() does under the hood. What should be added to this command in order for it to work?
Be the code
df <- data.frame(a = c(1,2,3,4),
b = c("au", "de", "ch", "li"))
Are these statements TRUE or FALSE?
mean(df$a) == 2.5typeof(as.matrix(df)[,1]) is numeric (or double)Statement for illustrative purposes (just for you to see it once, not that I expect you to learn this)
as_tibble(df)[1:2,1] contains the same information as df[1:2, 1] but the formatting is different.as_tibble(df)[1:2,1]
## # A tibble: 2 × 1 ## a ## <dbl> ## 1 1 ## 2 2
df[1:2, 1]
## [1] 1 2
Rectangular data
Non-rectangular data
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
unique_id,indicator_id,name,measure,measure_info,geo_type_name, geo_join_id,geo_place_name,time_period,start_date,data_value 216498,386,Ozone (O3),Mean,ppb,CD, 313,Coney Island (CD13),Summer 2013,2013-06-01T00:00:00,34.64 216499,386,Ozone (O3),Mean,ppb,CD, 313,Coney Island (CD13),Summer 2014,2014-06-01T00:00:00,33.22 219969,386,Ozone (O3),Mean,ppb,Borough, 1,Bronx,Summer 2013,2013-06-01T00:00:00,31.25
<row> <unique_id>216498</unique_id> <indicator_id>386</indicator_id> <name>Ozone (O3)</name> <measure>Mean</measure> <measure_info>ppb</measure_info> <geo_type_name>CD</geo_type_name> <geo_join_id>313</geo_join_id ><geo_place_name>Coney Island (CD13)</geo_place_name> <time_period>Summer 2013</time_period> <start_date>2013-06-01T00:00:00</start_date> <data_value>34.64</data_value> </row> <unique_id>216499</unique_id> <indicator_id>386</indicator_id> ... <\row>
The actual content we know from the csv-type example above is nested between the ‘row’-tags:
<row> ... </row>
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
There are two principal ways to link variable names to values.
<variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
<filename>ISCCPMonthly_avg.nc</filename>
<filepath>/usr/local/fer_data/data/</filepath>
<badflag>-1.E+34</badflag>
<subset>48 points (TIME)</subset>
<longitude>123.8W(-123.8)</longitude>
<latitude>48.8S</latitude>
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
<filename>ISCCPMonthly_avg.nc</filename>.<case date="16-JAN-1994" temperature="9.200012" />.Attributes-based:
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
Tag-based:
<cases>
<case>
<date>16-JAN-1994<date/>
<temperature>9.200012<temperature/>
<case/>
<case>
<date>16-FEB-1994<date/>
<temperature>10.70001<temperature/>
<case/>
<case>
<date>16-MAR-1994<date/>
<temperature>7.5<temperature/>
<case/>
<case>
<date>16-APR-1994<date/>
<temperature>8.100006<temperature/>
<case/>
<cases/>
<?xml version="1.0" encoding="UTF-8"?>
<customers>
<person>
<name>Michael Scott</name>
<orders>
<product> x </product>
<product> y </product>
</orders>
</person>
<person>
<name>Dwight Schrutte</name>
<orders>
<product> a </product>
<product> x </product>
</orders>
</person>
</customers>
# load packages
library(xml2)
# parse XML, represent XML document as R object
xml_doc <- read_xml("customers.xml")
xml_doc
## {xml_document}
## <customers>
## [1] <person>\n <name>Michael Scott</name>\n <orders>\n <product> x </product>\n <product> ...
## [2] <person>\n <name>Dwight Schrutte</name>\n <orders>\n <product> a </product>\n <produc ...
XML instead of xml2. Those are equivalent, but xml2 is updated and faster.‘customers’ is the root-node, ‘persons’ are its children:
# navigate downwards persons <- xml_children(xml_doc) persons
## {xml_nodeset (2)}
## [1] <person>\n <name>Michael Scott</name>\n <orders>\n <product> x </product>\n <product> ...
## [2] <person>\n <name>Dwight Schrutte</name>\n <orders>\n <product> a </product>\n <produc ...
Navigate sidewards and upwards
# navigate sidewards persons[1]
## {xml_nodeset (1)}
## [1] <person>\n <name>Michael Scott</name>\n <orders>\n <product> x </product>\n <product> ...
xml_siblings(persons[[1]])
## {xml_nodeset (1)}
## [1] <person>\n <name>Dwight Schrutte</name>\n <orders>\n <product> a </product>\n <produc ...
# navigate upwards xml_parents(persons)
## {xml_nodeset (1)}
## [1] <customers>\n <person>\n <name>Michael Scott</name>\n <orders>\n <product> x </pr ...
Extract specific parts of the data:
# find data via XPath customer_names <- xml_find_all(xml_doc, xpath = ".//name") customer_names
## {xml_nodeset (2)}
## [1] <name>Michael Scott</name>
## [2] <name>Dwight Schrutte</name>
# extract the data as text xml_text(customer_names)
## [1] "Michael Scott" "Dwight Schrutte"
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>
JSON:
{"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumber": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
],
"gender": {
"type": "male"
}
}
XML:
<person> <firstName>John</firstName> <lastName>Smith</lastName> </person>
JSON:
{"firstName": "John",
"lastName": "Smith",
}
# load packages
library(jsonlite)
# parse the JSON-document shown in the example above
json_doc <- fromJSON("data/person.json")
# look at the structure of the document
str(json_doc)
## List of 6 ## $ firstName : chr "John" ## $ lastName : chr "Smith" ## $ age : int 25 ## $ address :List of 4 ## ..$ streetAddress: chr "21 2nd Street" ## ..$ city : chr "New York" ## ..$ state : chr "NY" ## ..$ postalCode : chr "10021" ## $ phoneNumber:'data.frame': 2 obs. of 2 variables: ## ..$ type : chr [1:2] "home" "fax" ## ..$ number: chr [1:2] "212 555-1234" "646 555-4567" ## $ gender :List of 1 ## ..$ type: chr "male"
The nesting structure is represented as a nested list:
# navigate the nested lists, extract data # extract the address part json_doc$address
## $streetAddress ## [1] "21 2nd Street" ## ## $city ## [1] "New York" ## ## $state ## [1] "NY" ## ## $postalCode ## [1] "10021"
# extract the gender (type) json_doc$gender$type
## [1] "male"
HyperText Markup Language (HTML), designed to be read by a web browser.
HTML documents/webpages consist of ‘semi-structured data’:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
<h2> hello, world </h2>
</body>
</html>
In this example, we look at Wikipedia’s Economy of Switzerland page.
-> Exercise session this afternoon!
Text is unstructured data. Text analysis and feature extraction is the basis for new genAI models!
-> check the code example on Canvas.
R.
Realistic image of a beach scene during sunset. A computer sits open on the sand displaying R software. Next to the computer, there are two chilled beers.